The theme for this week is clustering and classification. This is something totally new for me and let´s see what I will learn this week.
In this exercise we are examining the Boston dataset from R MASS package. This dataset is about housing values in suburbs of Boston. The data frame has 506 observations and 14 variables.
library(MASS)
#calling for the Boston dataset form MASS package
dim(Boston)
## [1] 506 14
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
So we have 506 subrubs of Boston and here are the explanations of the different variables:
crim : per capita crime rate by town zn: proportion of residential land zoned for lots over 25,000 sq.ft. indus: proportion of non-retail business acres per town chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) nox: nitrogen oxides concentration (parts per 10 million) rm: average number of rooms per dwelling age: proportion of owner-occupied units built prior to 1940 dis: weighted mean of distances to five Boston employment centres rad: index of accessibility to radial highways tax: full-value property-tax rate per $10,000. ptratio: pupil-teacher ratio by town black: 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town lstat: lower status of the population (percent) medv: median value of owner-occupied homes in $1000s
#calling for the different packages I might use in this exercise
library(ggplot2)
library(GGally)
library(tidyverse)
## -- Attaching packages ------------------------------------------------------------------ tidyverse 1.2.0 --
## <U+221A> tibble 1.3.4 <U+221A> purrr 0.2.4
## <U+221A> tidyr 0.7.2 <U+221A> dplyr 0.7.4
## <U+221A> readr 1.1.1 <U+221A> stringr 1.2.0
## <U+221A> tibble 1.3.4 <U+221A> forcats 0.2.0
## -- Conflicts --------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## x dplyr::select() masks MASS::select()
library(corrplot)
## corrplot 0.84 loaded
#checking the summary of the Boston dataset,
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
Here is the summary of the data. Chas is a binary variable. Other variables execpt for the crim and zn seems to be normally distributed (mean and median more or less close to each other).
Let’s also make a graph..
#graphical overview
pairs(Boston)
With the pairs plot it’s not very easy the see any corralations. Let´s study the correlations between the different variables with correlation plot.
#making a correlation matrix and drawing a correlation plot to be able to visualie it better and for easier interpretation
cormatrix <- cor(Boston)
cormatrix %>% round(digits=2)
## crim zn indus chas nox rm age dis rad tax
## crim 1.00 -0.20 0.41 -0.06 0.42 -0.22 0.35 -0.38 0.63 0.58
## zn -0.20 1.00 -0.53 -0.04 -0.52 0.31 -0.57 0.66 -0.31 -0.31
## indus 0.41 -0.53 1.00 0.06 0.76 -0.39 0.64 -0.71 0.60 0.72
## chas -0.06 -0.04 0.06 1.00 0.09 0.09 0.09 -0.10 -0.01 -0.04
## nox 0.42 -0.52 0.76 0.09 1.00 -0.30 0.73 -0.77 0.61 0.67
## rm -0.22 0.31 -0.39 0.09 -0.30 1.00 -0.24 0.21 -0.21 -0.29
## age 0.35 -0.57 0.64 0.09 0.73 -0.24 1.00 -0.75 0.46 0.51
## dis -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00 -0.49 -0.53
## rad 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00 0.91
## tax 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00
## ptratio 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46
## black -0.39 0.18 -0.36 0.05 -0.38 0.13 -0.27 0.29 -0.44 -0.44
## lstat 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54
## medv -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47
## ptratio black lstat medv
## crim 0.29 -0.39 0.46 -0.39
## zn -0.39 0.18 -0.41 0.36
## indus 0.38 -0.36 0.60 -0.48
## chas -0.12 0.05 -0.05 0.18
## nox 0.19 -0.38 0.59 -0.43
## rm -0.36 0.13 -0.61 0.70
## age 0.26 -0.27 0.60 -0.38
## dis -0.23 0.29 -0.50 0.25
## rad 0.46 -0.44 0.49 -0.38
## tax 0.46 -0.44 0.54 -0.47
## ptratio 1.00 -0.18 0.37 -0.51
## black -0.18 1.00 -0.37 0.33
## lstat 0.37 -0.37 1.00 -0.74
## medv -0.51 0.33 -0.74 1.00
corrplot(cormatrix, method="circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)
There is strong negative correlation (big red ball), between dis-nox, dis-age adn dis-indus. Meaning that moving furher from Boston employment centers the Nitrogen oxide concentration goes down, the proportion of owner-occupied units built prior to 1940 goes down. This makes sense.
Also lower status of the population (lstat) and median value of owner-occupied homes (medv) have strong neagtive correlation. When the percent of lower status of the population gets bigger the median value of owner-occupied homes in $1000s gets smaller. This also is understandable.
Rad and tax have strong positive correlation, meaning when the index of accessibility to radial highways rises also the full-value property-tax rate per $10,000 rises. Why not?
Let’s move furher with the analysis..
I need to standardise the dataset to get normally distributed data. I print the summary of the scaled data set.
#Standardising the dataset
boston_scaled <- scale(Boston)
#printing out summaries of the scaled data
summary(boston_scaled)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
#My boston_sclaed data is a matrix and I make it as a data frame for the future
class(boston_scaled)
## [1] "matrix"
boston_scaled<- as.data.frame(boston_scaled)
Now we have scaled the data and as seen in the summary now all the means and medians are close to each other meaning that they are normally distributed and with the help of the scaling this data can be fitted in a model.
Next I will change the continuous crime rate variable in my data set to be a categorical variable. I want to cut the crim variable by quantiles to get the high, low and middle rates of crime into their own categories. I drop the old crim variable from the dataset and replace it with the new crime variable.
#create a quantile vector
bins <- quantile(boston_scaled$crim)
bins
## 0% 25% 50% 75% 100%
## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
#and create a categorial variable crime
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label= c("low", "med_low", "med_high", "high"))
table(crime)
## crime
## low med_low med_high high
## 127 126 126 127
summary(boston_scaled)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
library(dplyr)
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)
summary(boston_scaled)
## zn indus chas nox
## Min. :-0.48724 Min. :-1.5563 Min. :-0.2723 Min. :-1.4644
## 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723 1st Qu.:-0.9121
## Median :-0.48724 Median :-0.2109 Median :-0.2723 Median :-0.1441
## Mean : 0.00000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723 3rd Qu.: 0.5981
## Max. : 3.80047 Max. : 2.4202 Max. : 3.6648 Max. : 2.7296
## rm age dis rad
## Min. :-3.8764 Min. :-2.3331 Min. :-1.2658 Min. :-0.9819
## 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049 1st Qu.:-0.6373
## Median :-0.1084 Median : 0.3171 Median :-0.2790 Median :-0.5225
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617 3rd Qu.: 1.6596
## Max. : 3.5515 Max. : 1.1164 Max. : 3.9566 Max. : 1.6596
## tax ptratio black lstat
## Min. :-1.3127 Min. :-2.7047 Min. :-3.9033 Min. :-1.5296
## 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049 1st Qu.:-0.7986
## Median :-0.4642 Median : 0.2746 Median : 0.3808 Median :-0.1811
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332 3rd Qu.: 0.6024
## Max. : 1.7964 Max. : 1.6372 Max. : 0.4406 Max. : 3.5453
## medv crime
## Min. :-1.9063 low :127
## 1st Qu.:-0.5989 med_low :126
## Median :-0.1449 med_high:126
## Mean : 0.0000 high :127
## 3rd Qu.: 0.2683
## Max. : 2.9865
dim(boston_scaled)
## [1] 506 14
Now I need to divide the dataset to train and test sets, so that 80% of the data belongs to the train set.
#Here I make the train and the test sets. I choose 80% of the observations to the train set and the rest to the test set
dim(boston_scaled)
## [1] 506 14
n <- nrow(boston_scaled)
n
## [1] 506
ind <- sample(n, size = n * 0.8)
dim(ind)
## NULL
#create the train set
train <- boston_scaled[ind,]
str(train)
## 'data.frame': 404 obs. of 14 variables:
## $ zn : num -0.487 -0.487 -0.487 2.943 -0.487 ...
## $ indus : num -0.867 -0.376 0.406 -0.902 2.42 ...
## $ chas : num -0.272 -0.272 -0.272 -0.272 -0.272 ...
## $ nox : num -0.343 -0.299 -1.016 -1.24 0.469 ...
## $ rm : num -0.352 0.27 -0.392 0.82 -1.239 ...
## $ age : num -1.211 1.013 -0.933 -1.445 1.056 ...
## $ dis : num 1.04 -0.647 0.811 0.628 -0.969 ...
## $ rad : num -0.522 -0.522 -0.637 -0.637 -0.637 ...
## $ tax : num -1.093 -0.144 -0.707 -0.969 1.796 ...
## $ ptratio: num 0.806 1.129 -1.134 0.344 0.76 ...
## $ black : num 0.441 0.422 0.441 0.441 -0.138 ...
## $ lstat : num -0.6502 -0.0536 -0.3155 -1.3056 1.5848 ...
## $ medv : num -0.1558 -0.2971 -0.0906 0.6488 -1.6889 ...
## $ crime : Factor w/ 4 levels "low","med_low",..: 1 2 1 1 2 4 1 4 3 3 ...
#create the test set
test <- boston_scaled[-ind,]
str(test)
## 'data.frame': 102 obs. of 14 variables:
## $ zn : num 0.0487 0.0487 -0.4872 -0.4872 -0.4872 ...
## $ indus : num -0.476 -0.476 -0.437 -0.437 -0.437 ...
## $ chas : num -0.272 -0.272 -0.272 -0.272 -0.272 ...
## $ nox : num -0.265 -0.265 -0.144 -0.144 -0.144 ...
## $ rm : num -0.388 -0.563 -0.794 -1.017 0.554 ...
## $ age : num -0.0702 -1.0507 0.0329 1.0489 0.6652 ...
## $ dis : num 0.838414 0.786365 0.000692 0.001357 0.210835 ...
## $ rad : num -0.522 -0.522 -0.637 -0.637 -0.637 ...
## $ tax : num -0.577 -0.577 -0.601 -0.601 -0.601 ...
## $ ptratio: num -1.5 -1.5 1.18 1.18 1.18 ...
## $ black : num 0.426 0.371 0.375 0.218 0.258 ...
## $ lstat : num -0.0312 0.4281 -0.1923 1.1717 -0.0943 ...
## $ medv : num 0.0399 -0.0906 -0.4711 -0.9713 -0.1667 ...
## $ crime : Factor w/ 4 levels "low","med_low",..: 2 2 3 3 3 3 1 2 2 2 ...
Now I’m making a linear discriminant analysis on the train set. I use the categorical crime rate as the target variable and all the other variables are predictor variables. I draw the LDA (bi)plot of the model.
# fit the linear discriminant analysis on the train set. crime as the target variable and all the other variables as predictor variables
lda.fit <- lda(formula= crime ~ ., data = train)
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
# drawing a plot of the lda results
plot(lda.fit, dimen = 2, col=classes, pch=classes)
lda.arrows(lda.fit, myscale = 1)
Then I predict the classes with the LDA model on my test data and cross tabulate the results with the crime categories from the test set.
#saving the crime categories from the test set
correct_classes <- test$crime
library(dplyr)
#removing the categorial crime variable from the test dataset
test <- dplyr::select(test, -crime)
#Predicting the vallses with the LDA model on the test data
lda.pred <- predict(lda.fit, newdata = test)
#cross tablating the results
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 15 7 1 0
## med_low 9 13 7 0
## med_high 0 6 18 1
## high 0 0 0 25
Here I can see how well my model is working with the predicting. My model works well predicting the high crime rates, byt it makes some errors predicting the other classes. The same phenomena was visible in the train set plot.
Next I move towards clustering and measure the distances. I use the Euklidean distance, which is possibly the most common one. K-means is old and much used clustering method. Kmeans counts the distance matrix automatically but I have to choose the number of clusters. I tryed to make the model wiht 4 clusters, but for me it seems that 3 clusters works better.
Investigate what is the optimal number of clusters and run the algorithm again. Visualize the clusters (for example with the pairs() or ggpairs() functions, where the clusters are separated with colors) and interpret the results.
#Reloading the Boston dataset and standardising the dataset (variables have to be normally ditributed)
dim(Boston)
## [1] 506 14
scale_Boston2 <- scale(Boston)
scale_Boston2 <- as.data.frame(scale_Boston2)
#Calculating the distances. Euklidean distance
dist_eu <- dist(scale_Boston2)
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
# k-means clustering
km <-kmeans(scale_Boston2, centers = 3)
# ploting the Boston dataset with clusters
pairs(scale_Boston2, col = km$cluster)
pairs(scale_Boston2[1:6], col = km$cluster)
pairs(scale_Boston2[7:13], col = km$cluster)
Next I investigate what is the optimal number of clusters. There are many ways to find the optimal number of clusters but here I will be using the Total of within cluster sum of squares (WCSS) and visualising the result with a plot.
# determine the number of clusters
k_max <- 10
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(scale_Boston2, k)$tot.withinss})
# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')
The optimal number of clusters is when the total WCSS drops radically and above I can see that it happens around x= 2. So the optimal number of clusters would be 2. Next I run the algorithm again with two clusters.
# k-means clustering
km <-kmeans(scale_Boston2, centers = 2)
# plot the Boston dataset with clusters
pairs(scale_Boston2, col = km$cluster)
pairs(scale_Boston2[1:8], col = km$cluster)
pairs(scale_Boston2[6:13], col = km$cluster)
Interpretation
With the bigger pairs plot is more difficult to interpret the results so I made also two more precise plots to be able to study some effects more precise. In the first plot (where all the variables are included) I can see that the variables chas doesn´t follow any pattern with any of the variables. There are many pairs that doesn´t follow any nice pattern. However I think I might find negative correlation between indus-dis, nox-dis, dis-lstat and positive correlation between indus-nox, age-nox, age-lstat.